370 8.5  Advanced In Silico Analysis Tools

of PCA enables identification of the principal directions in which the data vary. In essence,

the principal components, or normal modes, of a general dataset are equivalent to axes that

can be in multidimensional space, parallel to which there is most variation in the data. There

can be several principal components depending on the number of independent/​orthogonal

features of the data. In terms of images, this in effect represents an efficient method of data

compression, by reducing the key information parallel to typically just a few tens of principal

components that encapsulate just the key features of an image.

In computational terms, the principal components of an image are found from a stack

of similar images containing the same features and then calculating the eigenvectors and

eigenvalues of the equivalent data covariance matrix for the image stack. An eigenvector

that has the largest eigenvalue is equivalent to the direction of the greatest variation, and the

eigenvector with the second largest eigenvalue is an orthogonal direction that has the next

highest variation, and so forth for higher orders of variation beyond this. These eigenvectors

can then be summed to generate a compressed version of any given image such that if more

eigenvectors are included, then the compression is lower and the quality of the compressed

image is better.

Each image of area i × j pixels may be depicted as a vector in (i × j)2 dimensional hyper­

space, which has coordinates that are defined by the pixel intensity values. A stack of such

images is thus equivalent to a cloud of points defined by the ends of these vectors in this

hyperspace such that images that share similar features will correspond to points in the

cloud that are close to each other. A pixel-​by-​pixel comparison of all images as is the case

in maximum likelihood methods is very slow and computationally costly. PCA instead can

reduce the number of variables describing the stack imaging data and find a minimum

number of variables in the hyperspace, which appear to be uncorrelated, that is, the prin­

cipal components. Methods involving multivariate statistical analysis (MSA) can be used to

identify the principal components in the hyperspace; in essence, these generate an estimate

for the covariance matrix from a set of images based on pairwise comparisons of images and

calculate the eigenvectors of the covariance matrix.

These eigenvectors are a much smaller subset of the raw data and correspond to regions

in the image set of the greatest variation. The images can then be depicted in a compressed

form with relatively little loss in information since small variations are in general due to noise

as the linear sum of a given set of the identified eigenvectors, referred to as an eigenimage.

PCA is particularly useful in objectively determining different image classes from a given

set on the basis of the different combinations of eigenvalues of the associated eigenvectors.

These different image classes can often be detected as clusters in m-​dimensional space where

m is the number of eigenvectors used to represent the key features of the image set. Several

clustering algorithms are available to detect these, a common one being k-​means that links

points in m-​dimensional space into clusters on the basis of being within a threshold nearest-​

neighbor separation distance. Such methods have been used, for example, to recognize

different image classes in EM and cryo-​EM images in particular (see Chapter 5) for molecular

complexes trapped during the sample preparation process in different metastable conform­

ational states. In thus being able to build up different image classes from a population of sev­

eral images, averaging can be performed separately within each image class to generate often

exquisite detail of molecular machines in different states. From simple Poisson sampling

statistics, averaging across n such images in a given class reduces the noise on the averaged

imaged by a factor of ~√n. By stitching such averages from different image classes together,

a movie can often be made to suggest actual dynamic movements involved in molecular

machines. This was most famously performed for the stepping motion of the muscle protein

myosin on F-​actin filaments.

Algorithms involving the wavelet transform (WT) are being increasingly developed to

tackle problems of denoising, image segmentation, and recognition. For the latter, WT is a

complementary technique to PCA. In essence, WT provides an alternative to the DFT but

decomposes an image into two separate spatial frequency components (high and low). These

two components can then be combined to the given four separate resulting image outputs.

The distribution of pixel intensities in each of these four WT output images represents a

compressed signature of the raw image data. Filtering can be performed by blocking different